Let's be honest — professors all vary in teaching style (and quality). Students tend to be extremely vocal about their opinions of professors. Everyone is looking out for one other. Friends want each other to get the best professors that they can and avoid those that they may not learn the best from. As a result, online platforms have been created to house student reviews of professors, the most commonly known website being Rate My Professors, which has data on over 1.3 million professors, 7,000 schools and 15 million ratings. Three student at the University of Maryland, College Park even took the initiative to create their own platform to gather specifically UMD professor ratings, PlanetTerp, which includes over 11,000 professors and 16,000 reviews. PlanetTerp has the additional feature of including course grades for each UMD course; as of right now there are nearly 300,000 course grades stored on the site.
Starting in 2013, The Diamondback, UMD's premier independent student newspaper, began publishing an annual salary guide: The Diamondback Salary Guide. The Diamondback Salary Guide displays every university employee's yearly pay in an easily digestible format for all to view. This information is public data provided to The Diamondback by the university itself; The Diamondback simply made this data more accessible to all by posting it on a single website.
The Diamondback Salary Guide states, "[w]e won't tell you what conclusions to draw from these numbers, but it's our hope that they'll give you the information you need to reflect on your own." In this final tutorial, we plan to do just that: compare both the salaries and ratings of UMD professors and reflect on our findings. From our own past experiences, we have observed that our favorite professors are not always the ones being paid the highest salaries. We are interested in the possiblity of a potential correlation between these two attributes. If there is a correlation between professor salary and rating, what is it? If a correlation exists, can we use this information to predict professor salary based on student reviews and vice versa?
We hypothesize that there will be a negative correlation between a professor's rating and their yearly salary. In other words, as a professor's rating increases, their yearly salary should decrease. We predict this might be the case because tenured professors often retain higher annual salaries, even if their teaching quality dips or decreases over time.
In order to observe the relationship between professor salary and rating, we collected data from three sources: Diamondback Salary Guide (DBK), Rate My Professors (RMP), and PlanetTerp (PT). DBK was our source of professor salary data, and a combination of RMP and PT was used as our source of professor rating data.
Diamondback Salary Guide has an undocumented API. However, we were able to learn about the API by looking at the network requests as we modified parameters on the site, which meant we could programmatially go through all of the pages and pull full data for all of the years that the Salary Guide tracks (2013-2022).
Our scraping code for DBK data can be found in src/scrape_salaries.py
Rate My Professors also has an undocumented API. This is discovered by inspecting the network requests as we loaded a page of professor reviews, noting that there was a GraphQL query, then inspecting and then copying over the GraphQL query, authentication, and the necessary variables we needed to emulate the request locally.
Interestingly, although their API technically requires authentication, the request from the website leaks the necessary Basic authentication header, which is test:test encoded in base64.
In addition, RMP performs fuzzy searching on their search endpoint, which lets us implement fuzzy name matching while we collect the RMP data. As we were gathering the data from RMP, we fed the RMP API each name in our DBK dataset. If RMP found a match for the name, it returned the appropriate professor rating information. This meant that each professor we had in our RMP dataset was automatically matched with a professor in our DBK dataset as we gathered the RMP professor data.
Our scraping code for RMP data can be found in src/scrape_rmp.py
PlanetTerp was created by UMD students and thus the creators were generous enough to document an API to help fellow students use the data available on their website.
Using their /api/v1/professors/ endpoint, we were able to collect a list of all of the professors that PlanetTerp has data on that have taught at UMD (over 11,000!), and get a list of all of the courses they've taught, their average rating over all of their courses, and all of their reviews, each of which included the text content, rating, expected grade, and creation time.
Our scraping code for PlanetTerp data can be found in src/scrape_pt.py
First, let's import our scraped data. We saved data collected from each of our three sources in their own separate CSV files.
import pandas as pd
import numpy as np
salaries_df = pd.read_csv("./src/data/salaries.csv")
pt_df = pd.read_csv("./src/data/pt_ratings.csv")
rmp_df = pd.read_csv("./src/data/rmp_ratings.csv")
Before we began doing anything with our data, we first needed to clean it up.
The DBK salary guide formats names differently from PlanetTerp and RMP, and contains escaped newlines returned from the API requests. As such, we decided to rearrange the first and last names in order to help our fuzzy search algorithm, and replaced newlines with spaces throughout the dataset. We also converted the salary strings to floats and extracted the school from the department strings.
# Rearrange first and last names to standardize name search in salaries
salaries_df["name"] = salaries_df["employee"].apply(
lambda x: " ".join(x.split(", ")[::-1])
)
# Replace newlines with spaces
salaries_df["department"] = salaries_df["department"].str.replace("\n", " ").str.strip()
# Converse salaries to floats
salaries_df["salary"] = (
salaries_df["salary"].replace("[\$,]", "", regex=True).astype(float)
)
# Extract school from department data
salaries_df["school"] = salaries_df["department"].str.split("-").str[0]
salaries_df.loc[~salaries_df["department"].str.contains("-"), "school"] = np.nan
While collecting data from RMP, we noticed something odd about each professor’s Overall Quality score. The results we calculated when averaging a professor’s individual quality ratings were not equal to the professor's Overall Quality score. We are not sure what factors are taken into account by RMP when calculating overall quality. When students create new reviews on RMP, they are asked to score the professor’s helpfulness and clarity. We can see each review’s helpfulRating and clarityRating in the API data which we collected. However, the RMP website only displays a “Quality” score. In the vast majority of cases, we have found that the Quality score is calculated by averaging the Helpful and Clarity scores ((helpfulRating + clarityRating) / 2). However, after performing a few calculations by hand, we found that a professor’s Overall Quality is not a result of averaging each review’s Quality score.
Let’s take Clyde Kruskal as an example: at the time of our calculations, RMP gave Kruskal an Overall Quality score of 2.30. However, the average of each review’s Quality was 2.14, the average of each review’s helpfulRating was 2.11, and the average of each review’s clarityRating was 2.16, none of which are equal to 2.30. It is unclear what is causing this discrepancy. Is RMP factoring in the difficulty ratings of the professors? How recent each review is? Overall Quality score is a mystery black box number to us.
Since we do not know how RMP is calculating this score, we chose to average each review’s quality rating and use this value for the average rating, since we know exactly how this is calculated.
rmp_df = rmp_df[rmp_df["reviews"] != "[]"]
# Snippet from src/scrape_rmp.py
def calculate_ratings(names, rmp_get_ratings):
df = pd.DataFrame(columns=["name", "rating", "courses", "reviews"])
for i, name in enumerate(names):
# handle cases where someone has middle name(s)
splitted = name.split(" ")
# if multiple middle names, only use first and last
if len(splitted) >= 3:
name = splitted[0] + " " + splitted[-1]
print(f"getting reviews for {name} {i}/{len(names)}")
# unique set of courses taught by a professor
courses = set()
# call function to make request to api
ratings = rmp_get_ratings(name)
reviews = []
score = 0
# iterate over each review and get clarity/helpful ratings
for rating in ratings:
data = rating["node"]
course = data["class"]
courses.add(course)
score += data["clarityRating"]
score += data["helpfulRating"]
# create our review object
reviews.append(
{
"professor": name,
"course": course,
"review": data["comment"],
"rating": data["clarityRating"],
"expected_grade": data["grade"],
"created": data["date"],
}
)
if len(ratings) != 0:
# since we add both ratings, divide by 2
score /= len(ratings) * 2
else:
score = 0
# append to dataframe
df.loc[len(df)] = [name, score, list(courses), reviews]
PlanetTerp has many listings for professors that have zero ratings, which is not helpful in our data exploration. For this reason, we removed all professors from our PT dataset who had no reviews. We also noticed that it was possible for PT to have multiple listings for the same professor (see Madeleine Goh and Madeleine Goh). These duplicate entries are eventually broken out into individual review rows, so we don't need to make any special exceptions for these professors.
# Drop professors without any reviews
pt_df = pt_df[pt_df["reviews"] != "[]"]
To connect a professor’s salary to their ratings, we needed to find a way to match the names from each dataset to each other. This proved to be a bit more difficult than we expected, because professor names were not standardized between the three platforms. Sometimes professor names included middle names, sometimes they included a middle initial, and sometimes no middle name was provided at all. Occasionally, professor nicknames were listed instead of their full names. With over three thousand different professors, we could not possibly match professor names by hand. Thus we needed a method to find the best professor matches between the three datasets. We used fuzzy name matching to accomplish this task. Fuzzy matching (also known as approximate string matching) is an artificial intelligence and machine learning technology that helps identify two elements that are approximately similar but not exactly the same.
We explored two different options for matching professor names from PlanetTerp to Diamondback Salary Guide. One option that we considered was using Hello My Name Is (HMNI), a form of fuzzy name matching using machine learning. However, we decided against using HMNI because it was two years outdated and had trouble running on our versions of python. The next method that we tried was using fuzzywuzzy or fuzzyset, which also performs fuzzy name matching, but uses the Levenshtein distance to calculate similarities between names. The Levenshtein distance is the number of deletions, insertions, or substitutions required to transform one string to another. We ultimately decided to use fuzzyset to match professor names from PT to DBK because fuzzyset had faster performance than fuzzywuzzy, and we were receiving more successful, correct matches than with HMNI.
However, name matching did not end there. As previously mentioned, some professors had middle names or middle initials while others did not. This heavily impacted the Levenshtein distance calculations being performed by fuzzyset. After running a preliminary round of name matching, we noticed that many professors in PT were not being matched to the correct listing in DBK because of the presence/absence of middle names; this was especially the case for those with longer middle names. In order to resolve this issue, we ran two rounds of fuzzyset name matching. In the first round, we attempted to match the entire name, and in the second round we matched with only the frist and last name. We only ran a second round of name matching with professors that did not get matched very accurately in the first round.
With this addition, we were able to match over 500 previously unmatched professors whose data would have been thrown away without this change (the number of matched professors increased from 2,036 to 2,561). Both rounds considered a name match to be a string match with a confidence value of 0.75 or higher. While this does not 100% guarantee that every pair of names that we match are the same person, this method of name matching was the best that we could do considering that we did not have a validation set to check the accuracy of our fuzzy matching results and there were simply too many professor names for our group to check over every pairing by hand.
try:
from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
from fuzzyset import FuzzySet
# Return first and last name
def get_fl(s: str):
sl = s.split()
return f"{sl[0]} {sl[-1]}"
# Merge datasets fuzzily
def fuzzy_merge(
d1: pd.DataFrame, d2: pd.DataFrame, fuzz_on="", alpha=0.75, beta=0.75, how="inner"
):
d1_keys = d1[fuzz_on]
d2_keys = d2[fuzz_on]
# Create the corresponding fuzzy set for our keys
# We pick the larger keyset to "fuzz" off of for performance and accuracy reasons
fuzz_left = len(d2_keys.unique()) > len(d1_keys.unique())
if fuzz_left:
fuzz = FuzzySet(d2_keys.unique())
fuzz_fl = FuzzySet(d2_keys.apply(get_fl).unique())
else:
fuzz = FuzzySet(d1_keys.unique())
fuzz_fl = FuzzySet(d1_keys.apply(get_fl).unique())
# Row helper that grabs matching name from fuzzy set
def fuzzy_match(row):
key = row[fuzz_on]
matches = fuzz.get(key)
match_conf, match_name = matches[0]
# Beta is our cutoff confidence for doing 2nd round matching w/o middle names
if match_conf <= beta:
matches = fuzz_fl.get(key)
match_conf, match_name = matches[0]
# Return match if confidence is >= alpha
return match_name if match_conf >= alpha else None
# Apply fuzzy match and merge datasets
if fuzz_left:
d1["_fuzz"] = d1.apply(fuzzy_match, axis=1)
return pd.merge(d1, d2, left_on="_fuzz", right_on=fuzz_on, how=how).rename(
columns={"_fuzz": fuzz_on}
)
else:
d2["_fuzz"] = d2.apply(fuzzy_match, axis=1)
return pd.merge(d1, d2, left_on=fuzz_on, right_on="_fuzz", how=how).rename(
columns={"_fuzz": fuzz_on}
)
# Merge PlanetTerp <-> DBK salaries
merge_pt = fuzzy_merge(pt_df, salaries_df, fuzz_on="name", how="inner")
merge_pt.head()
/tmp/ipykernel_30824/2604881466.py:31: DeprecationWarning: This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead. /tmp/ipykernel_30824/2604881466.py:36: DeprecationWarning: This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead.
| courses | average_rating | type | reviews | name_x | slug | name | year | employee | department | division | title | salary | name_y | school | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... | 5.0 | professor | [{'professor': 'Abhijit Dasgupta', 'course': '... | Abhijit Dasgupta | dasgupta_abhijit | Abhijit Dasgupta | 2013 | Dasgupta, Abhijit | ENGR-Mechanical Engineering | A. James Clark School of Engineering | Prof | 167138.22 | Abhijit Dasgupta | ENGR |
| 1 | ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... | 5.0 | professor | [{'professor': 'Abhijit Dasgupta', 'course': '... | Abhijit Dasgupta | dasgupta_abhijit | Abhijit Dasgupta | 2014 | Dasgupta, Abhijit | ENGR-Mechanical Engineering | A. James Clark School of Engineering | Prof | 183580.92 | Abhijit Dasgupta | ENGR |
| 2 | ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... | 5.0 | professor | [{'professor': 'Abhijit Dasgupta', 'course': '... | Abhijit Dasgupta | dasgupta_abhijit | Abhijit Dasgupta | 2015 | Dasgupta, Abhijit | ENGR-Mechanical Engineering | A. James Clark School of Engineering | Prof | 190895.40 | Abhijit Dasgupta | ENGR |
| 3 | ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... | 5.0 | professor | [{'professor': 'Abhijit Dasgupta', 'course': '... | Abhijit Dasgupta | dasgupta_abhijit | Abhijit Dasgupta | 2016 | Dasgupta, Abhijit | ENGR-Mechanical Engineering | A. James Clark School of Engineering | Prof | 190895.40 | Abhijit Dasgupta | ENGR |
| 4 | ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... | 5.0 | professor | [{'professor': 'Abhijit Dasgupta', 'course': '... | Abhijit Dasgupta | dasgupta_abhijit | Abhijit Dasgupta | 2017 | Dasgupta, Abhijit | ENGR-Mechanical Engineering | A. James Clark School of Engineering | Prof | 198038.26 | Abhijit Dasgupta | ENGR |
# Merge rmp_df <-> DBK salaries
merge_rmp = fuzzy_merge(rmp_df, salaries_df, fuzz_on="name", how="inner")
merge_rmp.head()
/tmp/ipykernel_30824/2604881466.py:31: DeprecationWarning: This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead. /tmp/ipykernel_30824/2604881466.py:36: DeprecationWarning: This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead.
| name_x | rating | courses | reviews | name | year | employee | department | division | title | salary | name_y | school | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pamela Abshire | 3.333333 | ['ENEE419A', 'ENEE408D'] | [{'professor': 'Pamela Abshire', 'course': 'EN... | Pamela A. Abshire | 2013 | Abshire, Pamela A. | ENGR-Electrical & Computer Engineering | A. James Clark School of Engineering | Assoc Prof | 82872.96 | Pamela A. Abshire | ENGR |
| 1 | Pamela Abshire | 3.333333 | ['ENEE419A', 'ENEE408D'] | [{'professor': 'Pamela Abshire', 'course': 'EN... | Pamela A. Abshire | 2013 | Abshire, Pamela A. | ENGR-Institute for Systems Research | A. James Clark School of Engineering | Assoc Prof | 55149.36 | Pamela A. Abshire | ENGR |
| 2 | Pamela Abshire | 3.333333 | ['ENEE419A', 'ENEE408D'] | [{'professor': 'Pamela Abshire', 'course': 'EN... | Pamela A. Abshire | 2013 | Abshire, Pamela A. | UGST-Honors College | Undergraduate Studies | Lecturer | 5000.00 | Pamela A. Abshire | UGST |
| 3 | Pamela Abshire | 3.333333 | ['ENEE419A', 'ENEE408D'] | [{'professor': 'Pamela Abshire', 'course': 'EN... | Pamela A. Abshire | 2014 | Abshire, Pamela A. | ENGR-Electrical & Computer Engineering | A. James Clark School of Engineering | Assoc Prof | 82427.95 | Pamela A. Abshire | ENGR |
| 4 | Pamela Abshire | 3.333333 | ['ENEE419A', 'ENEE408D'] | [{'professor': 'Pamela Abshire', 'course': 'EN... | Pamela A. Abshire | 2014 | Abshire, Pamela A. | ENGR-Institute for Systems Research | A. James Clark School of Engineering | Assoc Prof | 66496.05 | Pamela A. Abshire | ENGR |
After invididually merging the PT and RMP data with the DBK salaries, we extracted all the reviews from each professor across these two platforms and created a dataframe for each review entry:
from ast import literal_eval
import os
# Cache the reviews, since we do some heavy GPU intensive sentiment analysis later
if not os.path.exists("./src/data/reviews.csv"):
reviews_df = []
# Combine our individually merged dfs and group by name
for name, rows in pd.concat([merge_pt, merge_rmp]).groupby("name"):
for rs in map(literal_eval, rows["reviews"].unique()):
for r in rs:
reviews_df.append({**r, "professor": name})
reviews_df = pd.DataFrame(reviews_df)
# Drop expected_grade because that doesn't exist for PT
reviews_df = reviews_df.drop(columns=["expected_grade"])
# Replace name column w/ professor, which is our DBK name
reviews_df = reviews_df.rename(columns={"professor": "name"})
# Fix datetimes
reviews_df["created"] = pd.to_datetime(reviews_df["created"].str.replace("UTC", ""))
# Get year of created
reviews_df["year"] = pd.DatetimeIndex(reviews_df["created"]).year
# NOTE: This is a placeholder for later num_reviews calculations -- it should be 1 for each row
reviews_df["num_reviews"] = 1
else:
reviews_df = pd.read_csv("./src/data/reviews.csv", lineterminator="\n", index_col=0)
reviews_df.head()
| name | course | review | rating | created | year | num_reviews | |
|---|---|---|---|---|---|---|---|
| 0 | A W. Kruglanski | PSYC489H | DO NOT TAKE PSYC489H "Motivated Social Cogniti... | 2 | 2015-09-07 18:44:00+00:00 | 2015 | 1 |
| 1 | A.U. Shankar | CMSC412 | Lectures are pretty dry and difficult to follo... | 3 | 2013-01-02 21:32:00+00:00 | 2013 | 1 |
| 2 | A.U. Shankar | CMSC412 | Professor: He does have a stutter, but if you ... | 3 | 2012-12-23 03:51:00+00:00 | 2012 | 1 |
| 3 | A.U. Shankar | CMSC412 | This is a horrible class. The projects are imp... | 1 | 2012-10-29 00:54:00+00:00 | 2012 | 1 |
| 4 | A.U. Shankar | CMSC412 | I have a lot of respect for Dr. Shankar. He is... | 5 | 2012-05-24 13:00:00+00:00 | 2012 | 1 |
Here's the resulting breakdown of our dataset: | Column | Description | |--------|----------------| | name | Fuzzy matched DBK name of the professor | | course | The course that the review was written for | | review | Contents of the review | | rating | Rating for the professor on a scale of 1-5 | | created | Datetime for when the review was written| | year | Year in which the review was written| | num_reviews | Used later to count the number of reviews (1 for the moment) |
After matching DBK salaries to PT ratings, we created a preliminary graph to visualize the data that we had tirelessly toiled to collect, tidy, and match.
For this first graph, we plotted every professor who has at least one review in either Rate My Professors and PlanetTerp and at least one entry in the Diamondback Salary Guide. We plotted each professor using their average rating and their most recently posted salary.
import plotly.io as pio
import matplotlib.pyplot as plt
# Styles & use plotly as our backend
pd.options.plotting.backend = "plotly"
pio.templates.default = "plotly_dark"
plt.style.use("dark_background")
# Group by each professor and year -- then sum their salaries
# This makes it so that we have a single salary record for each name + year
salaries_df = salaries_df.groupby(["name", "year"], as_index=False).agg(
{
"school": "first",
"salary": "sum",
}
)
# Match each review with the corresponding salary record for that name + year
merged_all_years_all_reviews = reviews_df.merge(salaries_df, on=["name", "year"], how="left")
# Match all reviews with the latest name + year record
merged_last_year_all_reviews = (
reviews_df.merge(salaries_df, on=["name", "year"], how="outer")
.sort_values("year", ascending=False)
.groupby("name", as_index=False)
.agg(
{
"school": "first",
"salary": "first",
"name": "first",
"year": "first",
"num_reviews": "sum",
"rating": "mean",
}
)
)
# Multi-use labels for all plots
labels = {
"rating": "Average Rating (1 to 5)",
"salary": "Salary (US Dollars)",
"num_reviews": "Number of Reviews",
"school": "School",
"sentiment": "Sentiment (-1 to 1)",
}
# Plot all professors w/ their latest salary as the y value, and average rating as their x value
merged_last_year_all_reviews.plot(
kind="scatter",
x="rating",
y="salary",
hover_data=["name", "year", "num_reviews"],
trendline="ols",
trendline_color_override="orange",
title="Average Rating vs. Most Recent Salary",
labels=labels,
)
Looking at this preliminary graph, we noticed a large concentration of points on lines x = 1.0, 2.0, 3.0, 4.0, and 5.0. These concentrations are from the large numbers of professors on PlanetTerp whose students generally don’t hold any strong positive/negative opinions, and only have 1 review. After seeing this, we decided to filter out any professors with very few reviews, which does reduce the size of our dataset, but it reduces the number of one-off really high/low reviews that might otherwise skew our data.
# Same as above, but we only take professors w/ >= 10 reviews
merged_last_year_all_reviews[merged_last_year_all_reviews["num_reviews"] >= 10].plot(
kind="scatter",
x="rating",
y="salary",
hover_data=["name", "year", "num_reviews"],
trendline="ols",
trendline_color_override="orange",
title="Average Rating vs. Most Recent Salary (professors with at least 10 reviews)",
labels=labels,
)
This looks much better. 👍👍👍
Using these datapoints, let's label each point by the school the professor is in.
# Same as above, but we color code by school
merged_last_year_all_reviews[merged_last_year_all_reviews["num_reviews"] >= 10].plot(
kind="scatter",
x="rating",
y="salary",
color="school",
hover_data=["name", "year", "num_reviews", "school"],
title="Average Rating vs. Most Recent Salary (colored by School)",
labels=labels,
)
It seems like there is an overwhelming number of CMNS professors on this scatterplot. Out of curiousity, let's also take a look at how many reviews each professor gets in each department.
# Group by name and sum up the number of reviews
merged_all_years_all_reviews.groupby("name", as_index=False).agg(
{
"num_reviews": "sum",
"school": "first",
"salary": "first",
"year": "first",
"name": "first",
}
).plot(kind="box", x="school", y="num_reviews", color="school", hover_data=["name"], labels=labels)
merged_all_years_all_reviews.groupby("school")["num_reviews"].sum().sort_values(
ascending=False
).head(10)
school CMNS 5860 ARHU 2709 BSOS 2054 BMGT 1135 ENGR 1112 AGNR 541 SPHL 380 INFO 297 JOUR 214 UGST 160 Name: num_reviews, dtype: int64
The CMNS professors have the most reviews per professor. This makes intuitive sense, since CMNS (more specifically CMSC students) are likely on the internet more often and are more technologically savvy, and thus these students are reviewing more of their professors. There could also be a large number of CMSC reviews because more CMNS professors exist compared to other departments. However, ARHU has the 2nd most professors but not the 2nd most number of reviews.
Next, let's graph the professors for each department on a separate graph to see if the ratings vs. salaries for each department follow a similar trend.
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import statsmodels.formula.api as smf
temp = merged_last_year_all_reviews[merged_last_year_all_reviews["num_reviews"] >= 10]
# Create subplots
fig = make_subplots(
rows=(len(temp["school"].unique()) // 2),
cols=2,
subplot_titles=sorted(temp["school"].unique()),
)
# Create subplot for each school
for i, (school, school_df) in enumerate(temp.groupby("school")):
# Drop rows w/o salaries
school_df = school_df.dropna(subset=["salary"])
# Skip if school has no datapoints
if len(school_df) == 0:
continue
row = (i // 2) + 1
col = (i % 2) + 1
# Plot rating vs. salary
fig.add_trace(
go.Scatter(x=school_df["rating"], y=school_df["salary"], mode="markers"),
row=row,
col=col,
)
# Plot regline
model = smf.ols(formula="salary ~ rating", data=school_df).fit()
y_hat = model.predict(school_df["rating"])
fig.add_trace(
go.Scatter(
x=school_df["rating"], y=y_hat, mode="lines", line=dict(color="#ffe476")
),
row=row,
col=col,
)
# Update axes
fig.update_xaxes(title_text=labels["rating"], row=row, col=col)
fig.update_yaxes(title_text=labels["salary"], row=row, col=col)
# Update size + title + disable legend
fig.update_layout(
height=3000,
width=1200,
title_text="Rating vs. Salary Per Department",
showlegend=False,
)
# Normalize our axises
fig.update_xaxes(range=[0.8, 5.2])
fig.update_yaxes(range=[0, temp["salary"].max()])
# Show figure
fig.show()
As we can see, each department is extremely different. There does not appear to be a similar trend whatsoever. However, some departments have significantly less datapoints than others (some do not even have the >1 points necessary to plot a linear regression).
More analysis needs to be done. Currently, we are plotting any professor that has at least one rating and at least one salary. However, this does not account for salaries that have changed over time due to changes in position, inflation, pay raises, etc. Before we made any assumptions about the general trend of the data, let's take time into account.
# Create subplot
fig = make_subplots(
rows=5,
cols=2,
start_cell="top-left",
subplot_titles=[f"Rating vs. Salary for {i}" for i in range(2013, 2023)],
)
# Create subplot for each year
i = 0
for year, year_df in merged_all_years_all_reviews.groupby("year"):
# Calculate average rating and # of reviews for each professor
year_df = year_df.groupby("name", as_index=False).agg(
{
"rating": "mean",
"school": "first",
"salary": "first",
"year": "first",
"num_reviews": "sum",
}
)
# Filter by at least 3 reviews and drop professors w/o any salaries
year_df = year_df[year_df["num_reviews"] >= 3]
year_df = year_df.dropna(subset=["salary"])
# Skip empty years
if len(year_df) == 0:
continue
row = (i // 2) + 1
col = (i % 2) + 1
# Plot rating vs. salary
fig.add_trace(
go.Scatter(x=year_df["rating"], y=year_df["salary"], mode="markers"),
row=row,
col=col,
)
# Plot regline
model = smf.ols(formula="salary ~ rating", data=year_df).fit()
# Output coefficients
print(str(year) + " Slope: " + str(model.params[1]))
y_hat = model.predict(year_df["rating"])
fig.add_trace(
go.Scatter(
x=year_df["rating"], y=y_hat, mode="lines", line=dict(color="#ffe476")
),
row=row,
col=col,
)
# Update labels
fig.update_xaxes(title_text=labels["rating"], row=row, col=col)
fig.update_yaxes(title_text=labels["salary"], row=row, col=col)
i += 1
# Update size + title + disable legend
fig.update_layout(
height=1500, width=1200, title_text="Rating vs. Salary Per Year", showlegend=False
)
# Normalize x axis
fig.update_xaxes(range=[0.8, 5.2])
fig.show()
2013 Slope: -13864.290727440639 2014 Slope: -2577.7150769865366 2015 Slope: 1849.6050756652894 2016 Slope: -4653.105928150357 2017 Slope: -8476.70450319327 2018 Slope: -2126.2029707391066 2019 Slope: -1068.2108372486932 2020 Slope: -7154.679293679653 2021 Slope: -7917.6360369754475 2022 Slope: -4878.901763837461
Now that we've separated our data by year, there appears to be a slight negative correlation between average professor rating and salary. The only exception to this trend is 2015, where the slope of the regression line is positive, although also very shallow.
The most steep negative slopes are for the years 2013, 2017, 2020, and 2021. While the circumstances of 2013 and 2017 are not clear, it is likely that students were dissatisfied with the quality of teaching during the height of the COVID-19 pandemic (2020 and 2021) as professors were adjusting to new methods of online teaching, resulting in lots of extremely positive and extremely negative student reviews. These extreme reviews would thus have an impact on the steepness of the linear regression line.
The first thing we learned in this class was that creating word clouds impresses anyone. For this reason, we obviously had to make a word cloud of the most common words used in professor reviews. We tried to remove the most common school/college related words: class, course, lecture, lectures, professor, student, students, exam, exams, test, and tests.
from wordcloud import WordCloud, STOPWORDS
# Shared config for each wordcloud
kwargs = {
"background_color": "black",
"max_font_size": 40,
"scale": 3,
"colormap": "Set2",
"stopwords": set(
[
"class",
"course",
"lecture",
"professor",
"student",
"students",
"exam",
"exams",
"test",
"tests",
"lectures",
]
)
| STOPWORDS,
}
# Collect review words
words = merged_all_years_all_reviews["review"].str.cat(sep=" ")
# Generate wordcloud
wc = WordCloud(**kwargs).generate(words)
# Show figure
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()
Admittedly, this word cloud shows us nothing. HOWEVER, it looks pretty neat. Let's separate the "good" reviews from the "bad" reviews and see if the most common words differ drastically. We define a "good" review as a review with a rating above 3 stars, and a "bad" review as a review with a rating under 3 stars. Let's also make them into turtles to show our school pride.
# GOOD REVIEWS
from PIL import Image
# Turtle :)
mask = np.array(Image.open("src/img/turtle.jpg"))
# Collect words (for reviews w/ rating above 3)
words = merged_all_years_all_reviews[merged_all_years_all_reviews["rating"] > 3][
"review"
].str.cat(sep=" ")
# Generate wordcloud
wc = WordCloud(mask=mask, **kwargs).generate(words)
# Show figure
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()
# BAD REVIEWS
# Turtle :(
mask = np.array(Image.open("src/img/upsidedown_turtle.jpg"))
# Collect words (for reviews w/ rating below 3)
words = merged_all_years_all_reviews[merged_all_years_all_reviews["rating"] < 3][
"review"
].str.cat(sep=" ")
# Generate wordcloud
wc = WordCloud(mask=mask, **kwargs).generate(words)
# Show figure
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()
Of the "good" and "bad" word turtles, the words that stood out the most to us were:
GOOD: easy, great, interesting, good, helpful, extra credit
BAD: difficult, worst, hard, boring, never, avoid, nothing
We wanted to come up with a numeric metric that could help us better guage the exact positivity / negativity of the review comments. It turns out that this problem can be solved with Sentiment Analysis, which uses NLP and machine learning to tokenize and quantify a series of words into a sentiment label and confidence score, which can be used to summarize the mood and tone of our reviews. As such, we used transformers, an ML library provided by Hugging Face, to create a sentiment pipeline that generates a sentiment label + score for each of our reviews. The sentiment pipeline uses the DistilBERT model, which is a pre-trained NLP model that can be used to perform sentiment analysis.
# Cache our reviews
if not os.path.exists("./src/data/reviews.csv"):
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
# Create sentiment pipeline
sentiment_pipeline = pipeline("sentiment-analysis", device=0)
def get_sentiment(review):
# Ignore empty reviews
if not review:
return None
# Feed review into pipeline and extract sentiment
# NOTE: DistilBERT truncates input to the first 512 tokens
sentiment = sentiment_pipeline(review, truncation=True)[0]
return (-1 if sentiment["label"] == "NEGATIVE" else 1) * sentiment["score"]
# Apply & save to file
reviews_df["sentiment"] = reviews_df["review"].apply(get_sentiment)
reviews_df.to_csv("./src/data/reviews.csv")
reviews_df.head(10)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. /home/cf12/miniconda3/envs/directml/lib/python3.10/site-packages/transformers/pipelines/base.py:1043: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
| name | course | review | rating | created | year | num_reviews | sentiment | |
|---|---|---|---|---|---|---|---|---|
| 0 | A W. Kruglanski | PSYC489H | DO NOT TAKE PSYC489H "Motivated Social Cogniti... | 2 | 2015-09-07 18:44:00+00:00 | 2015 | 1 | -0.998593 |
| 1 | A.U. Shankar | CMSC412 | Lectures are pretty dry and difficult to follo... | 3 | 2013-01-02 21:32:00+00:00 | 2013 | 1 | -0.999379 |
| 2 | A.U. Shankar | CMSC412 | Professor: He does have a stutter, but if you ... | 3 | 2012-12-23 03:51:00+00:00 | 2012 | 1 | -0.976684 |
| 3 | A.U. Shankar | CMSC412 | This is a horrible class. The projects are imp... | 1 | 2012-10-29 00:54:00+00:00 | 2012 | 1 | -0.999185 |
| 4 | A.U. Shankar | CMSC412 | I have a lot of respect for Dr. Shankar. He is... | 5 | 2012-05-24 13:00:00+00:00 | 2012 | 1 | 0.996378 |
| 5 | A.U. Shankar | CMSC216 | Stutters. Slow lectures. Exams are exactly the... | 1 | 2017-11-19 22:26:47+00:00 | 2017 | 1 | -0.999431 |
| 6 | A.U. Shankar | CMSC216 | One of the worst lecturers I have had so far. ... | 1 | 2018-01-23 22:50:53+00:00 | 2018 | 1 | -0.999699 |
| 7 | A.U. Shankar | CMSC216 | Shankar is a nice guy if you were to ever spea... | 2 | 2018-04-11 00:07:08+00:00 | 2018 | 1 | -0.954501 |
| 8 | A.U. Shankar | CMSC216 | He is very nice if you talk to him, but if you... | 1 | 2019-10-20 23:58:06+00:00 | 2019 | 1 | -0.992549 |
| 9 | A.U. Shankar | CMSC216 | Stutters, which makes his lectures impossible ... | 1 | 2019-12-17 21:24:31+00:00 | 2019 | 1 | -0.999610 |
# Merge salaries with reviews (that have sentiment now)
reviews_salaries_df = reviews_df.merge(
salaries_df, on=["name", "year"], how="left"
).dropna(subset=["salary"])
import plotly.express as px
fig = px.scatter_3d(
reviews_df.merge(salaries_df, on=["name", "year"], how="left"),
x="rating",
y="sentiment",
z="salary",
color="school",
hover_data=["name"],
labels=labels,
)
fig.show()
To put some statistical numbers behind our graphs and attempt to prove our hypothesis, we created linear regression models aiming to correlate rating and salary.
We first tested to see if there was a correlation between just professor rating and salary.
# Create linreg model
reg = smf.ols(formula="salary ~ rating", data=reviews_salaries_df).fit()
print(reg.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.003
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 39.79
Date: Fri, 16 Dec 2022 Prob (F-statistic): 2.91e-10
Time: 03:48:41 Log-Likelihood: -1.8619e+05
No. Observations: 15174 AIC: 3.724e+05
Df Residuals: 15172 BIC: 3.724e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.029e+05 1020.900 100.841 0.000 1.01e+05 1.05e+05
rating -1680.3686 266.388 -6.308 0.000 -2202.522 -1158.215
==============================================================================
Omnibus: 2990.138 Durbin-Watson: 0.447
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6903.113
Skew: 1.119 Prob(JB): 0.00
Kurtosis: 5.431 Cond. No. 9.87
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Our R-squared value for this linear regression was extremely low (0.003), suggesting that there is no correlation between rating and salary. Our next step was to try creating a linear regression which incorporated sentiment as an additional variable.
# Create linreg model
reg = smf.ols(formula="salary ~ rating * sentiment", data=reviews_salaries_df).fit()
print(reg.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.004
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 18.45
Date: Fri, 16 Dec 2022 Prob (F-statistic): 6.05e-12
Time: 03:48:41 Log-Likelihood: -1.8608e+05
No. Observations: 15166 AIC: 3.722e+05
Df Residuals: 15162 BIC: 3.722e+05
Df Model: 3
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.013e+05 1683.988 60.138 0.000 9.8e+04 1.05e+05
rating -807.0259 428.650 -1.883 0.060 -1647.231 33.179
sentiment 858.1646 1702.689 0.504 0.614 -2479.312 4195.641
rating:sentiment -876.1318 430.408 -2.036 0.042 -1719.784 -32.480
==============================================================================
Omnibus: 2986.092 Durbin-Watson: 0.448
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6888.173
Skew: 1.118 Prob(JB): 0.00
Kurtosis: 5.428 Cond. No. 27.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The R-square value was still extremely low (0.004). From our exploratory graphs, we noticed that different schools followed different trends; this was also the case for differing years. For this reason, we decided to also add in variables for year and school in this last linear regression model.
# Create BETTER linreg model
reg = smf.ols(
formula="salary ~ rating * sentiment * year * school", data=reviews_salaries_df
).fit()
print(reg.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.195
Model: OLS Adj. R-squared: 0.184
Method: Least Squares F-statistic: 19.14
Date: Fri, 16 Dec 2022 Prob (F-statistic): 0.00
Time: 03:48:44 Log-Likelihood: -1.8447e+05
No. Observations: 15166 AIC: 3.693e+05
Df Residuals: 14976 BIC: 3.708e+05
Df Model: 189
Covariance Type: nonrobust
==========================================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------------------
Intercept -1.585e+06 6.06e+06 -0.262 0.793 -1.35e+07 1.03e+07
school[T.ARCH] 4.285e+06 2.38e+07 0.180 0.857 -4.24e+07 5.1e+07
school[T.ARHU] -5.1e+06 6.62e+06 -0.771 0.441 -1.81e+07 7.87e+06
school[T.BMGT] -1.387e+07 7.29e+06 -1.903 0.057 -2.82e+07 4.14e+05
school[T.BSOS] 1.591e+05 6.81e+06 0.023 0.981 -1.32e+07 1.35e+07
school[T.CMNS] -1.39e+06 6.33e+06 -0.219 0.826 -1.38e+07 1.1e+07
school[T.DIT] -7.524e+07 1.11e+09 -0.068 0.946 -2.26e+09 2.11e+09
school[T.EDUC] 1.453e+07 1.55e+07 0.935 0.350 -1.59e+07 4.5e+07
school[T.ENGR] -1.389e+07 7.23e+06 -1.921 0.055 -2.81e+07 2.83e+05
school[T.EXST] -1.788e+10 1.56e+10 -1.144 0.253 -4.85e+10 1.28e+10
school[T.GRAD] 262.3035 872.643 0.301 0.764 -1448.183 1972.790
school[T.INFO] -1.025e+07 1.84e+07 -0.558 0.577 -4.63e+07 2.58e+07
school[T.IT] -2.492e+09 3.73e+10 -0.067 0.947 -7.55e+10 7.06e+10
school[T.JOUR] -1.375e+07 1.22e+07 -1.129 0.259 -3.76e+07 1.01e+07
school[T.LIBR] -3.139e+10 1.02e+11 -0.306 0.759 -2.32e+11 1.69e+11
school[T.PLCY] -7.291e+05 3.86e+07 -0.019 0.985 -7.64e+07 7.49e+07
school[T.PRES] -2.828e+08 1.67e+08 -1.697 0.090 -6.1e+08 4.39e+07
school[T.PUAF] 480.6775 1561.926 0.308 0.758 -2580.888 3542.243
school[T.SPHL] -1.763e+06 9.31e+06 -0.189 0.850 -2e+07 1.65e+07
school[T.SVPAAP] 2.46e+07 2.53e+07 0.972 0.331 -2.5e+07 7.42e+07
school[T.UGST] -2.556e+07 1.82e+07 -1.406 0.160 -6.12e+07 1.01e+07
school[T.USG] -2.264e+09 2.95e+09 -0.768 0.442 -8.04e+09 3.51e+09
school[T.VPA] -760.1330 2497.981 -0.304 0.761 -5656.482 4136.216
school[T.VPAA] -111.2615 354.374 -0.314 0.754 -805.877 583.354
school[T.VPAF] -3.257e+09 1.39e+10 -0.234 0.815 -3.06e+10 2.41e+10
school[T.VPR] 1.619e+09 4.32e+09 0.375 0.708 -6.85e+09 1.01e+10
school[T.VPSA] -1.979e+07 3.5e+07 -0.566 0.571 -8.83e+07 4.87e+07
school[T.VPUR] -887.9146 2886.380 -0.308 0.758 -6545.573 4769.744
rating 3.742e+05 1.54e+06 0.243 0.808 -2.64e+06 3.39e+06
rating:school[T.ARCH] -4.064e+06 5.99e+06 -0.679 0.497 -1.58e+07 7.67e+06
rating:school[T.ARHU] 7.987e+05 1.69e+06 0.473 0.636 -2.51e+06 4.11e+06
rating:school[T.BMGT] 3.881e+05 1.87e+06 0.208 0.835 -3.27e+06 4.05e+06
rating:school[T.BSOS] -8.178e+05 1.73e+06 -0.473 0.637 -4.21e+06 2.57e+06
rating:school[T.CMNS] -2.115e+05 1.61e+06 -0.131 0.895 -3.37e+06 2.94e+06
rating:school[T.DIT] 1.077e+07 2.23e+08 0.048 0.961 -4.26e+08 4.47e+08
rating:school[T.EDUC] -2.59e+06 4.21e+06 -0.615 0.538 -1.08e+07 5.66e+06
rating:school[T.ENGR] 3.808e+06 1.84e+06 2.072 0.038 2.05e+05 7.41e+06
rating:school[T.EXST] 3.549e+09 3.11e+09 1.140 0.254 -2.56e+09 9.65e+09
rating:school[T.GRAD] 234.5135 766.731 0.306 0.760 -1268.373 1737.400
rating:school[T.INFO] 6.638e+05 4.41e+06 0.150 0.880 -7.99e+06 9.31e+06
rating:school[T.IT] 6.851e+08 1.13e+10 0.061 0.951 -2.14e+10 2.27e+10
rating:school[T.JOUR] 4.001e+06 3.14e+06 1.275 0.202 -2.15e+06 1.02e+07
rating:school[T.LIBR] 9.165e+09 2.99e+10 0.307 0.759 -4.93e+10 6.77e+10
rating:school[T.PLCY] 9.269e+06 9.84e+06 0.942 0.346 -1e+07 2.86e+07
rating:school[T.PRES] 7.003e+07 3.43e+07 2.042 0.041 2.8e+06 1.37e+08
rating:school[T.PUAF] 56.6003 176.195 0.321 0.748 -288.763 401.963
rating:school[T.SPHL] -1.721e+06 2.42e+06 -0.711 0.477 -6.47e+06 3.02e+06
rating:school[T.SVPAAP] -5.7e+06 6.06e+06 -0.940 0.347 -1.76e+07 6.19e+06
rating:school[T.UGST] 4.154e+06 4.47e+06 0.929 0.353 -4.61e+06 1.29e+07
rating:school[T.USG] -1.132e+10 1.47e+10 -0.768 0.442 -4.02e+10 1.76e+10
rating:school[T.VPA] -92.7760 314.108 -0.295 0.768 -708.466 522.914
rating:school[T.VPAA] -78.4276 258.447 -0.303 0.762 -585.016 428.161
rating:school[T.VPAF] 6.486e+08 2.79e+09 0.232 0.816 -4.82e+09 6.12e+09
rating:school[T.VPR] -5.665e+08 1.44e+09 -0.393 0.694 -3.39e+09 2.26e+09
rating:school[T.VPSA] 2.145e+06 7.73e+06 0.277 0.781 -1.3e+07 1.73e+07
rating:school[T.VPUR] 43.7590 136.082 0.322 0.748 -222.978 310.496
sentiment -7.718e+06 6.16e+06 -1.252 0.210 -1.98e+07 4.36e+06
sentiment:school[T.ARCH] -1.295e+07 2.39e+07 -0.541 0.588 -5.99e+07 3.4e+07
sentiment:school[T.ARHU] 6.359e+06 6.72e+06 0.946 0.344 -6.82e+06 1.95e+07
sentiment:school[T.BMGT] -4.29e+06 7.4e+06 -0.579 0.562 -1.88e+07 1.02e+07
sentiment:school[T.BSOS] 6.872e+06 6.91e+06 0.994 0.320 -6.68e+06 2.04e+07
sentiment:school[T.CMNS] 7.835e+06 6.44e+06 1.217 0.224 -4.79e+06 2.05e+07
sentiment:school[T.DIT] -8.792e+07 1.12e+09 -0.079 0.937 -2.28e+09 2.1e+09
sentiment:school[T.EDUC] 1.808e+07 1.59e+07 1.138 0.255 -1.31e+07 4.92e+07
sentiment:school[T.ENGR] 4.598e+06 7.36e+06 0.625 0.532 -9.82e+06 1.9e+07
sentiment:school[T.EXST] -1.794e+10 1.56e+10 -1.149 0.250 -4.85e+10 1.27e+10
sentiment:school[T.GRAD] -5.8503 57.069 -0.103 0.918 -117.714 106.013
sentiment:school[T.INFO] 1.923e+07 1.86e+07 1.035 0.301 -1.72e+07 5.57e+07
sentiment:school[T.IT] -2.47e+09 3.73e+10 -0.066 0.947 -7.57e+10 7.07e+10
sentiment:school[T.JOUR] 2.431e+07 1.24e+07 1.953 0.051 -9.08e+04 4.87e+07
sentiment:school[T.LIBR] 2.937e+10 9.58e+10 0.306 0.759 -1.58e+11 2.17e+11
sentiment:school[T.PLCY] 8.612e+05 3.89e+07 0.022 0.982 -7.54e+07 7.71e+07
sentiment:school[T.PRES] -1.026e+08 1.67e+08 -0.615 0.539 -4.29e+08 2.24e+08
sentiment:school[T.PUAF] 21.2635 31.234 0.681 0.496 -39.960 82.487
sentiment:school[T.SPHL] 3.774e+06 9.39e+06 0.402 0.688 -1.46e+07 2.22e+07
sentiment:school[T.SVPAAP] 3.122e+07 2.64e+07 1.184 0.236 -2.05e+07 8.29e+07
sentiment:school[T.UGST] -5.103e+06 1.83e+07 -0.278 0.781 -4.1e+07 3.08e+07
sentiment:school[T.USG] 2.264e+09 2.95e+09 0.768 0.442 -3.51e+09 8.04e+09
sentiment:school[T.VPA] 5.4589 21.019 0.260 0.795 -35.741 46.659
sentiment:school[T.VPAA] 2.8087 7.391 0.380 0.704 -11.679 17.296
sentiment:school[T.VPAF] -3.262e+09 1.39e+10 -0.234 0.815 -3.06e+10 2.41e+10
sentiment:school[T.VPR] -1.88e+09 4.38e+09 -0.429 0.668 -1.05e+10 6.7e+09
sentiment:school[T.VPSA] -3.063e+07 3.53e+07 -0.869 0.385 -9.97e+07 3.85e+07
sentiment:school[T.VPUR] -5.6744 8.584 -0.661 0.509 -22.500 11.151
rating:sentiment 1.386e+06 1.56e+06 0.887 0.375 -1.68e+06 4.45e+06
rating:sentiment:school[T.ARCH] 4.635e+06 6.01e+06 0.771 0.441 -7.15e+06 1.64e+07
rating:sentiment:school[T.ARHU] -1.413e+06 1.71e+06 -0.826 0.409 -4.77e+06 1.94e+06
rating:sentiment:school[T.BMGT] 2.073e+06 1.89e+06 1.098 0.272 -1.63e+06 5.77e+06
rating:sentiment:school[T.BSOS] -7.294e+05 1.75e+06 -0.416 0.677 -4.16e+06 2.71e+06
rating:sentiment:school[T.CMNS] -1.616e+06 1.63e+06 -0.990 0.322 -4.82e+06 1.58e+06
rating:sentiment:school[T.DIT] 1.902e+07 2.24e+08 0.085 0.932 -4.2e+08 4.58e+08
rating:sentiment:school[T.EDUC] -3.796e+06 4.26e+06 -0.891 0.373 -1.21e+07 4.56e+06
rating:sentiment:school[T.ENGR] -1.026e+06 1.86e+06 -0.550 0.582 -4.68e+06 2.63e+06
rating:sentiment:school[T.EXST] 3.614e+09 3.13e+09 1.153 0.249 -2.53e+09 9.76e+09
rating:sentiment:school[T.GRAD] -5.8632 21.275 -0.276 0.783 -47.565 35.839
rating:sentiment:school[T.INFO] -4.461e+06 4.42e+06 -1.010 0.312 -1.31e+07 4.2e+06
rating:sentiment:school[T.IT] 6.851e+08 1.13e+10 0.061 0.952 -2.14e+10 2.28e+10
rating:sentiment:school[T.JOUR] -6.554e+06 3.18e+06 -2.058 0.040 -1.28e+07 -3.12e+05
rating:sentiment:school[T.LIBR] -8.765e+09 2.85e+10 -0.307 0.759 -6.47e+10 4.72e+10
rating:sentiment:school[T.PLCY] -7.2e+06 9.74e+06 -0.740 0.460 -2.63e+07 1.19e+07
rating:sentiment:school[T.PRES] -7.202e+06 3.43e+07 -0.210 0.834 -7.44e+07 6e+07
rating:sentiment:school[T.PUAF] -0.0674 0.172 -0.392 0.695 -0.405 0.270
rating:sentiment:school[T.SPHL] 9153.5626 2.42e+06 0.004 0.997 -4.74e+06 4.76e+06
rating:sentiment:school[T.SVPAAP] -4.292e+06 6.27e+06 -0.684 0.494 -1.66e+07 8e+06
rating:sentiment:school[T.UGST] 6.711e+05 4.49e+06 0.149 0.881 -8.14e+06 9.48e+06
rating:sentiment:school[T.USG] 1.132e+10 1.47e+10 0.768 0.442 -1.76e+10 4.02e+10
rating:sentiment:school[T.VPA] -0.1162 0.384 -0.303 0.762 -0.868 0.636
rating:sentiment:school[T.VPAA] -0.0016 0.011 -0.140 0.888 -0.024 0.021
rating:sentiment:school[T.VPAF] 6.544e+08 2.79e+09 0.235 0.814 -4.81e+09 6.11e+09
rating:sentiment:school[T.VPR] 6.155e+08 1.45e+09 0.424 0.672 -2.23e+09 3.46e+09
rating:sentiment:school[T.VPSA] 5.03e+06 7.8e+06 0.645 0.519 -1.03e+07 2.03e+07
rating:sentiment:school[T.VPUR] -0.0024 0.177 -0.013 0.989 -0.348 0.344
year 844.2779 3000.697 0.281 0.778 -5037.456 6726.011
year:school[T.ARCH] -2151.4849 1.18e+04 -0.182 0.855 -2.53e+04 2.1e+04
year:school[T.ARHU] 2504.3084 3278.295 0.764 0.445 -3921.551 8930.168
year:school[T.BMGT] 6880.3313 3611.573 1.905 0.057 -198.794 1.4e+04
year:school[T.BSOS] -87.8066 3372.337 -0.026 0.979 -6698.000 6522.387
year:school[T.CMNS] 675.7271 3137.588 0.215 0.829 -5474.329 6825.783
year:school[T.DIT] 3.721e+04 5.51e+05 0.068 0.946 -1.04e+06 1.12e+06
year:school[T.EDUC] -7210.6741 7701.452 -0.936 0.349 -2.23e+04 7885.114
year:school[T.ENGR] 6896.3813 3582.354 1.925 0.054 -125.472 1.39e+04
year:school[T.EXST] 8.882e+06 7.76e+06 1.144 0.253 -6.34e+06 2.41e+07
year:school[T.GRAD] -1.8856 4.178 -0.451 0.652 -10.075 6.304
year:school[T.INFO] 5056.2179 9098.445 0.556 0.578 -1.28e+04 2.29e+04
year:school[T.IT] 1.237e+06 1.85e+07 0.067 0.947 -3.5e+07 3.75e+07
year:school[T.JOUR] 6808.7787 6033.950 1.128 0.259 -5018.501 1.86e+04
year:school[T.LIBR] 1.527e+07 4.98e+07 0.307 0.759 -8.24e+07 1.13e+08
year:school[T.PLCY] 346.9105 1.91e+04 0.018 0.986 -3.71e+04 3.78e+04
year:school[T.PRES] 1.399e+05 8.25e+04 1.697 0.090 -2.17e+04 3.02e+05
year:school[T.PUAF] -37.5443 38.995 -0.963 0.336 -113.979 38.890
year:school[T.SPHL] 857.9856 4609.919 0.186 0.852 -8178.020 9893.991
year:school[T.SVPAAP] -1.218e+04 1.25e+04 -0.972 0.331 -3.67e+04 1.24e+04
year:school[T.UGST] 1.262e+04 9001.457 1.402 0.161 -5020.831 3.03e+04
year:school[T.USG] 1.12e+06 1.46e+06 0.768 0.443 -1.74e+06 3.98e+06
year:school[T.VPA] -133.6173 888.135 -0.150 0.880 -1874.471 1607.237
year:school[T.VPAA] 15.0971 30.792 0.490 0.624 -45.258 75.452
year:school[T.VPAF] 1.613e+06 6.9e+06 0.234 0.815 -1.19e+07 1.51e+07
year:school[T.VPR] -8.007e+05 2.14e+06 -0.375 0.708 -4.99e+06 3.39e+06
year:school[T.VPSA] 9794.6141 1.73e+04 0.566 0.572 -2.41e+04 4.37e+04
year:school[T.VPUR] -45.7452 125.694 -0.364 0.716 -292.120 200.630
rating:year -188.4963 762.540 -0.247 0.805 -1683.168 1306.176
rating:year:school[T.ARCH] 2015.2534 2966.200 0.679 0.497 -3798.862 7829.369
rating:year:school[T.ARHU] -393.0529 836.999 -0.470 0.639 -2033.673 1247.567
rating:year:school[T.BMGT] -188.7083 924.301 -0.204 0.838 -2000.452 1623.036
rating:year:school[T.BSOS] 406.7899 857.459 0.474 0.635 -1273.935 2087.515
rating:year:school[T.CMNS] 109.2402 797.346 0.137 0.891 -1453.656 1672.137
rating:year:school[T.DIT] -5322.8543 1.1e+05 -0.048 0.961 -2.21e+05 2.11e+05
rating:year:school[T.EDUC] 1285.3286 2086.327 0.616 0.538 -2804.128 5374.785
rating:year:school[T.ENGR] -1885.6324 910.582 -2.071 0.038 -3670.485 -100.780
rating:year:school[T.EXST] -1.763e+06 1.55e+06 -1.140 0.254 -4.8e+06 1.27e+06
rating:year:school[T.GRAD] -5.2030 13.597 -0.383 0.702 -31.854 21.448
rating:year:school[T.INFO] -325.4472 2184.455 -0.149 0.882 -4607.247 3956.353
rating:year:school[T.IT] -3.401e+05 5.59e+06 -0.061 0.951 -1.13e+07 1.06e+07
rating:year:school[T.JOUR] -1981.5227 1554.505 -1.275 0.202 -5028.543 1065.498
rating:year:school[T.LIBR] -4.484e+06 1.46e+07 -0.307 0.759 -3.31e+07 2.41e+07
rating:year:school[T.PLCY] -4580.9211 4873.209 -0.940 0.347 -1.41e+04 4971.166
rating:year:school[T.PRES] -3.465e+04 1.7e+04 -2.042 0.041 -6.79e+04 -1387.273
rating:year:school[T.PUAF] 11.6218 11.549 1.006 0.314 -11.015 34.258
rating:year:school[T.SPHL] 854.2979 1199.301 0.712 0.476 -1496.478 3205.074
rating:year:school[T.SVPAAP] 2821.7058 3002.397 0.940 0.347 -3063.359 8706.771
rating:year:school[T.UGST] -2052.1365 2214.397 -0.927 0.354 -6392.625 2288.352
rating:year:school[T.USG] 5.6e+06 7.3e+06 0.768 0.443 -8.7e+06 1.99e+07
rating:year:school[T.VPA] 65.9324 417.969 0.158 0.875 -753.338 885.203
rating:year:school[T.VPAA] 6.4811 33.095 0.196 0.845 -58.389 71.351
rating:year:school[T.VPAF] -3.213e+05 1.38e+06 -0.232 0.816 -3.03e+06 2.39e+06
rating:year:school[T.VPR] 2.802e+05 7.13e+05 0.393 0.694 -1.12e+06 1.68e+06
rating:year:school[T.VPSA] -1064.7484 3827.557 -0.278 0.781 -8567.229 6437.732
rating:year:school[T.VPUR] -57.8329 334.576 -0.173 0.863 -713.643 597.977
sentiment:year 3832.4920 3053.837 1.255 0.210 -2153.402 9818.386
sentiment:year:school[T.ARCH] 6408.2472 1.19e+04 0.540 0.589 -1.68e+04 2.97e+04
sentiment:year:school[T.ARHU] -3157.9669 3331.082 -0.948 0.343 -9687.295 3371.361
sentiment:year:school[T.BMGT] 2112.8755 3668.651 0.576 0.565 -5078.130 9303.881
sentiment:year:school[T.BSOS] -3412.9629 3425.704 -0.996 0.319 -1.01e+04 3301.837
sentiment:year:school[T.CMNS] -3891.2045 3191.181 -1.219 0.223 -1.01e+04 2363.901
sentiment:year:school[T.DIT] 4.348e+04 5.53e+05 0.079 0.937 -1.04e+06 1.13e+06
sentiment:year:school[T.EDUC] -8961.8104 7872.301 -1.138 0.255 -2.44e+04 6468.863
sentiment:year:school[T.ENGR] -2281.2498 3645.376 -0.626 0.531 -9426.633 4864.133
sentiment:year:school[T.EXST] 8.91e+06 7.75e+06 1.149 0.250 -6.29e+06 2.41e+07
sentiment:year:school[T.GRAD] 9.2325 27.594 0.335 0.738 -44.855 63.320
sentiment:year:school[T.INFO] -9536.4298 9200.686 -1.036 0.300 -2.76e+04 8498.040
sentiment:year:school[T.IT] 1.226e+06 1.85e+07 0.066 0.947 -3.51e+07 3.76e+07
sentiment:year:school[T.JOUR] -1.205e+04 6165.452 -1.955 0.051 -2.41e+04 32.767
sentiment:year:school[T.LIBR] -1.487e+07 4.85e+07 -0.306 0.759 -1.1e+08 8.02e+07
sentiment:year:school[T.PLCY] -449.7502 1.93e+04 -0.023 0.981 -3.82e+04 3.73e+04
sentiment:year:school[T.PRES] 5.075e+04 8.25e+04 0.615 0.538 -1.11e+05 2.12e+05
sentiment:year:school[T.PUAF] 24.7884 37.502 0.661 0.509 -48.719 98.296
sentiment:year:school[T.SPHL] -1878.0146 4650.369 -0.404 0.686 -1.1e+04 7237.278
sentiment:year:school[T.SVPAAP] -1.547e+04 1.31e+04 -1.185 0.236 -4.11e+04 1.01e+04
sentiment:year:school[T.UGST] 2525.6234 9079.922 0.278 0.781 -1.53e+04 2.03e+04
sentiment:year:school[T.USG] -1.12e+06 1.46e+06 -0.767 0.443 -3.98e+06 1.74e+06
sentiment:year:school[T.VPA] 158.6193 719.610 0.220 0.826 -1251.904 1569.142
sentiment:year:school[T.VPAA] 10.4482 12.005 0.870 0.384 -13.083 33.979
sentiment:year:school[T.VPAF] 1.616e+06 6.91e+06 0.234 0.815 -1.19e+07 1.52e+07
sentiment:year:school[T.VPR] 9.295e+05 2.16e+06 0.429 0.668 -3.31e+06 5.17e+06
sentiment:year:school[T.VPSA] 1.518e+04 1.75e+04 0.869 0.385 -1.91e+04 4.94e+04
sentiment:year:school[T.VPUR] 22.0298 135.518 0.163 0.871 -243.602 287.662
rating:sentiment:year -688.1618 774.219 -0.889 0.374 -2205.725 829.402
rating:sentiment:year:school[T.ARCH] -2293.7222 2978.574 -0.770 0.441 -8132.092 3544.648
rating:sentiment:year:school[T.ARHU] 701.0567 847.883 0.827 0.408 -960.897 2363.010
rating:sentiment:year:school[T.BMGT] -1024.9489 935.191 -1.096 0.273 -2858.038 808.141
rating:sentiment:year:school[T.BSOS] 362.6377 868.242 0.418 0.676 -1339.223 2064.498
rating:sentiment:year:school[T.CMNS] 802.1658 809.009 0.992 0.321 -783.591 2387.923
rating:sentiment:year:school[T.DIT] -9407.6274 1.11e+05 -0.085 0.932 -2.26e+05 2.08e+05
rating:sentiment:year:school[T.EDUC] 1882.3208 2111.954 0.891 0.373 -2257.367 6022.008
rating:sentiment:year:school[T.ENGR] 508.7386 923.571 0.551 0.582 -1301.574 2319.051
rating:sentiment:year:school[T.EXST] -1.795e+06 1.56e+06 -1.153 0.249 -4.85e+06 1.26e+06
rating:sentiment:year:school[T.GRAD] -7.9048 28.007 -0.282 0.778 -62.802 46.992
rating:sentiment:year:school[T.INFO] 2211.9624 2186.342 1.012 0.312 -2073.536 6497.460
rating:sentiment:year:school[T.IT] -3.401e+05 5.6e+06 -0.061 0.952 -1.13e+07 1.06e+07
rating:sentiment:year:school[T.JOUR] 3249.4439 1577.484 2.060 0.039 157.382 6341.506
rating:sentiment:year:school[T.LIBR] 4.405e+06 1.43e+07 0.307 0.759 -2.37e+07 3.25e+07
rating:sentiment:year:school[T.PLCY] 3566.6241 4819.608 0.740 0.459 -5880.398 1.3e+04
rating:sentiment:year:school[T.PRES] 3558.7343 1.7e+04 0.210 0.834 -2.97e+04 3.68e+04
rating:sentiment:year:school[T.PUAF] -11.8609 10.966 -1.082 0.279 -33.355 9.633
rating:sentiment:year:school[T.SPHL] -3.1089 1199.397 -0.003 0.998 -2354.075 2347.857
rating:sentiment:year:school[T.SVPAAP] 2126.8657 3104.023 0.685 0.493 -3957.398 8211.130
rating:sentiment:year:school[T.UGST] -333.1963 2225.194 -0.150 0.881 -4694.850 4028.457
rating:sentiment:year:school[T.USG] -5.601e+06 7.3e+06 -0.767 0.443 -1.99e+07 8.7e+06
rating:sentiment:year:school[T.VPA] -73.4847 385.215 -0.191 0.849 -828.553 681.584
rating:sentiment:year:school[T.VPAA] 18.3012 39.802 0.460 0.646 -59.716 96.319
rating:sentiment:year:school[T.VPAF] -3.241e+05 1.38e+06 -0.235 0.814 -3.03e+06 2.38e+06
rating:sentiment:year:school[T.VPR] -3.044e+05 7.18e+05 -0.424 0.672 -1.71e+06 1.1e+06
rating:sentiment:year:school[T.VPSA] -2493.1216 3862.732 -0.645 0.519 -1.01e+04 5078.306
rating:sentiment:year:school[T.VPUR] -74.2660 340.336 -0.218 0.827 -741.365 592.833
==============================================================================
Omnibus: 2654.981 Durbin-Watson: 0.461
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6882.544
Skew: 0.962 Prob(JB): 0.00
Kurtosis: 5.681 Cond. No. 1.25e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.19e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
This R-squared value, although not terribly close to 1, is much much higher than our previous models. The p-values for our coefficients are also extremely high.
reviews_salaries_df.plot(kind="scatter", x="salary", y="sentiment", labels=labels)
Unfortunetely, we were unable to find any statistically significant pieces of evidence that could support our original hypothesis after performing our analysis. We believe there may be several confounding factors that could have impacted the correlation between a professor's reviews and how much they make. For instance:
Future work on this subject should take student reviews with a grain of salt due to the issues mentioned above. Instead, school-backed metrics, such as course evaluations and gradebook data, should be taken into consideration, as these data sources are generally less self-selective and biased.